Conversation
There was a problem hiding this comment.
have two questions:
-
The window. which I was talking about is this, and I think it is present in the above code.
a) The first result task failed, due to which the event loop thread submitted a ResubmitFailedStage message .
b) But before this message was put in the queue, a new nessage of successful result task arrived.
So now the sequence of messages is : ResultTask ( successful) , followed by the ResubmitStageMessage
The event loop thread picks up result task successful and now there is a an output available.
The event loop thread now picks up ResubmitFailedStage task and sees stage.findMissingPartitions().length != rs.partitions.length
and proceeds to abort stage ( which I believe is that query will be aborted .... right?). -
why are you making a new stage Attempt, because there is already a committed result ( isnt it?). And as it cannot be reverted, query needs to abort ?
( I may be wrong in my understanding of the code of dagScheduler and stages, so pls bear with me).
There was a problem hiding this comment.
@attilapiros : also I am wondering if you had a clean build? Coz I think in my chnages when I did something like this, I saw a valid inDeterminate stage which had some partitions missing , but that code path was not from a failed task..it was a proper path...
so with this change, that path would also encounter abort.
There was a problem hiding this comment.
There was 1 failure in Streaming and one in SQL but unrelated.
There was a problem hiding this comment.
1.) There is no window. b) is tested actually tested by your createDagInterceptorForSpark51272, isn't it?
2.) That was a mistake
2b3fb52 to
0967f90
Compare
|
It is tested in terms of retry of all partitions of result, but I need to
relook, because as per my understanding, with your change, the query should
get aborted.
…On Tue, Apr 8, 2025 at 3:24 PM Attila Zsolt Piros ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
<#8 (comment)>:
> @@ -1554,6 +1554,14 @@ private[spark] class DAGScheduler(
case sms: ShuffleMapStage if stage.isIndeterminate && !sms.isAvailable =>
mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)
sms.shuffleDep.newShuffleMergeState()
+ case rs: ResultStage if stage.isIndeterminate &&
+ stage.findMissingPartitions().length != rs.partitions.length =>
+ stage.makeNewStageAttempt(rs.partitions.length)
+ listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo,
+ Utils.cloneProperties(jobIdToActiveJob(jobId).properties)))
+ abortStage(stage, "An indeterminate result stage cannot be reverted", None)
+ runningStages -= stage
+ return
case _ =>
1.) There is no window. b) is tested actually tested by your
createDagInterceptorForSpark51272, isn't it?
2.) That was a mistake
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2BNVM65WV7RIFAFG3L2YREC5AVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDONJRGUZTOMRZGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
I am thinking about refactoring the tests. |
|
Actually we should abort even if the first task was failed with fetch failed and have any completion with success following (so ignore won't work). The problem is not the output registration here but the temporary files which was written by the task body / or writing to any external database. When we get the Success event those files are already generated at the executor side. |
|
why abort, why not retry?
…On Wed, Apr 9, 2025 at 10:33 AM Attila Zsolt Piros ***@***.***> wrote:
Actually we should abort even if the first task was failed with fetch
failed and have any completion with success following (so ignore won't
work).
The problem is not the output registration here but the temporary files
which was written by the task body / or writing to any external database.
When we get the Success event those files are already generated at the
executor side.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2B4ADTT35AKINSCGKD2YVKWXAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGQ3DQOJXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
*attilapiros* left a comment (attilapiros/spark#8)
<#8 (comment)>
Actually we should abort even if the first task was failed with fetch
failed and have any completion with success following (so ignore won't
work).
The problem is not the output registration here but the temporary files
which was written by the task body / or writing to any external database.
When we get the Success event those files are already generated at the
executor side.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2B4ADTT35AKINSCGKD2YVKWXAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGQ3DQOJXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
I mean do these temporary outputs get reused?...There does not seem to be
any tests related to it.
Also, I do think that in the PR which you have , there remains a window
where the stage would abort, even if the first result task has failed and
retry all partition was possible.
But of course, if you all decide to go with aggressively aborting the
stage, then the above becomes moot.
…On Wed, Apr 9, 2025 at 10:43 AM Asif Shahid ***@***.***> wrote:
why abort, why not retry?
On Wed, Apr 9, 2025 at 10:33 AM Attila Zsolt Piros <
***@***.***> wrote:
> Actually we should abort even if the first task was failed with fetch
> failed and have any completion with success following (so ignore won't
> work).
>
> The problem is not the output registration here but the temporary files
> which was written by the task body / or writing to any external database.
> When we get the Success event those files are already generated at the
> executor side.
>
> —
> Reply to this email directly, view it on GitHub
> <#8 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC6XG2B4ADTT35AKINSCGKD2YVKWXAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGQ3DQOJXGM>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
> *attilapiros* left a comment (attilapiros/spark#8)
> <#8 (comment)>
>
> Actually we should abort even if the first task was failed with fetch
> failed and have any completion with success following (so ignore won't
> work).
>
> The problem is not the output registration here but the temporary files
> which was written by the task body / or writing to any external database.
> When we get the Success event those files are already generated at the
> executor side.
>
> —
> Reply to this email directly, view it on GitHub
> <#8 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC6XG2B4ADTT35AKINSCGKD2YVKWXAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGQ3DQOJXGM>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
|
Let's assume we have a job which writes to a table stored for example on HDFS. Let's say the indeterministic resultStage contains 2 tasks. One fetching from hostA, the other from hostB. Fetching fails from hostA so we do our part and post the resubmit failed stages. But the other task is still running as the fetch from hostB was successful. It even finishes successfully (we do not know when this happens as it is on the executor side). So its result will be committed with the Hadoop's FileOutputCommitter (where a task can either be aborted or committed but I do not think you can commit and change your mind latter to abort) and I do not think there is any guarantee what will happen if you recommit that Hadoop task again. So now as we are in an indeterministic stage and we know what is committed is outdated our |
|
The FileOutputCommitter is invoked from the driver side ( right ?) , the PR
opened has code , which will not allow the result to be committed even if
the task has succeeded, because preceding task has failed.
…On Wed, Apr 9, 2025 at 11:14 AM Attila Zsolt Piros ***@***.***> wrote:
Let's assume we have a job which writes to a table stored for example on
HDFS. Let's say the indeterministic resultStage contains 2 tasks. One
fetching from hostA, the other from hostB. Fetching fails from hostA so we
do our part and post the resubmit failed stages. But the other task is
still running as the fetch from hostB was successful. It even finishes
successfully (we do not know when this happens as it is on the executor
side). So its result will be committed with the Hadoop's
FileOutputCommitter
<https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html#commitTask-org.apache.hadoop.mapreduce.TaskAttemptContext->
(where a task can either be aborted or committed but I do not think you can
commit and change your mind latter to abort). I do not think there is any
guarantee what will happen if you recommit that Hadoop task again. So now
as we are in an indeterministic stage and we know what is committed is
outdated our
only option is abort the Hadoop job. But that's outside of the spark job:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L109
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2EN4CBQTOGJ7F5L4O32YVPSFAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU3DAMZRHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
*attilapiros* left a comment (attilapiros/spark#8)
<#8 (comment)>
Let's assume we have a job which writes to a table stored for example on
HDFS. Let's say the indeterministic resultStage contains 2 tasks. One
fetching from hostA, the other from hostB. Fetching fails from hostA so we
do our part and post the resubmit failed stages. But the other task is
still running as the fetch from hostB was successful. It even finishes
successfully (we do not know when this happens as it is on the executor
side). So its result will be committed with the Hadoop's
FileOutputCommitter
<https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html#commitTask-org.apache.hadoop.mapreduce.TaskAttemptContext->
(where a task can either be aborted or committed but I do not think you can
commit and change your mind latter to abort). I do not think there is any
guarantee what will happen if you recommit that Hadoop task again. So now
as we are in an indeterministic stage and we know what is committed is
outdated our
only option is abort the Hadoop job. But that's outside of the spark job:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L109
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2EN4CBQTOGJ7F5L4O32YVPSFAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU3DAMZRHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
From the driver and the executor: The task commit on the executor side as I have described above. |
|
There is an HA test which right now in the PR is disabled, but can be
enabled and run if you take both the PRs ( this and 51016). That test runs
for 10 minutes or so...can be increased. It takes into account data
integrity...If the retry all case resulted in data integrity issue ,
atleast a failure should have been seen...
…On Wed, Apr 9, 2025 at 11:26 AM Attila Zsolt Piros ***@***.***> wrote:
The FileOutputCommitter is invoked from the driver side ( right ?)
From the driver and the executor:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L97
The task commit on the executor side as I have described above.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2A5GVRRIE77DDOXXTD2YVQ6PAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU4DINRRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
*attilapiros* left a comment (attilapiros/spark#8)
<#8 (comment)>
The FileOutputCommitter is invoked from the driver side ( right ?)
From the driver and the executor:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L97
The task commit on the executor side as I have described above.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2A5GVRRIE77DDOXXTD2YVQ6PAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU4DINRRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
or may be if you can induce the test to expose the data integrity issue as
you are suggesting , is possible ?
…On Wed, Apr 9, 2025 at 11:36 AM Asif Shahid ***@***.***> wrote:
There is an HA test which right now in the PR is disabled, but can be
enabled and run if you take both the PRs ( this and 51016). That test runs
for 10 minutes or so...can be increased. It takes into account data
integrity...If the retry all case resulted in data integrity issue ,
atleast a failure should have been seen...
On Wed, Apr 9, 2025 at 11:26 AM Attila Zsolt Piros <
***@***.***> wrote:
> The FileOutputCommitter is invoked from the driver side ( right ?)
>
> From the driver and the executor:
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L97
>
> The task commit on the executor side as I have described above.
>
> —
> Reply to this email directly, view it on GitHub
> <#8 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC6XG2A5GVRRIE77DDOXXTD2YVQ6PAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU4DINRRGM>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
> *attilapiros* left a comment (attilapiros/spark#8)
> <#8 (comment)>
>
> The FileOutputCommitter is invoked from the driver side ( right ?)
>
> From the driver and the executor:
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L97
>
> The task commit on the executor side as I have described above.
>
> —
> Reply to this email directly, view it on GitHub
> <#8 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC6XG2A5GVRRIE77DDOXXTD2YVQ6PAVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJQGU4DINRRGM>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
|
Here you can see it iterates over on the partitions and calls where one So if there is one fetch for a host failed and another succeeded but you ignore the bookkeeping on the driver side and rerun all the task (as you treat all of them as missing one after the failure) you will duplicate the rows. |
|
I will go through the example, but logically speaking the executors cannot
independently commit the final results, that commit needs to be done by a
coordinator which I believe is the driver. so if the driver is not going
to commit a successful result ( if first task has failed), then there
should not be an issue.
Granted one successful test is not enough, but then there is no test
whatsoever to prove otherwise, atleast as of now.
If a test can come through , I do believe its a time worth the effort, as
it is related to data integrity.
It took me a good amount of time to prove the race as well as the basic
issue of inDeterminate not working,
In fact even when I had it in unit tests, it was not considered
sufficient, till actual data loss was proven.
If the assumption that executors will commit result independently and
accordingly the test fails, then what you are saying is true
But if that is not the case, then I am actually not sure what is the issue
with the PR which I have opened.
It is pretty ordinary change, IMO.
…On Thu, Apr 10, 2025 at 11:47 AM Attila Zsolt Piros < ***@***.***> wrote:
FileOtputCommitter is an interface and it has different implementations.
One successful test would not be sufficient here. So I was thinking how to
prove this without investing a ton of extra times and I would choose
another example:
writing the result via JDBC to an external DB.
Here you can see it iterates over on the partitions and calls INSERT INTO
s:
https://github.com/attilapiros/spark/blob/554d67817e44498cca9d1a211d8bdc4a69dc9d0e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L990-L992
where one INSERT INTO is SQL dialect specific:
https://github.com/attilapiros/spark/blob/554d67817e44498cca9d1a211d8bdc4a69dc9d0e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L983
So if there is one fetch for a host failed and another succeeded but you
ignore the bookkeeping on the driver side and rerun all the task (as you
treat all of the missing after the failure) you will duplicate the rows.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2BMXHUTRLL36FVP2N32Y24E7AVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJUHAZTOOJSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
*attilapiros* left a comment (attilapiros/spark#8)
<#8 (comment)>
FileOtputCommitter is an interface and it has different implementations.
One successful test would not be sufficient here. So I was thinking how to
prove this without investing a ton of extra times and I would choose
another example:
writing the result via JDBC to an external DB.
Here you can see it iterates over on the partitions and calls INSERT INTO
s:
https://github.com/attilapiros/spark/blob/554d67817e44498cca9d1a211d8bdc4a69dc9d0e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L990-L992
where one INSERT INTO is SQL dialect specific:
https://github.com/attilapiros/spark/blob/554d67817e44498cca9d1a211d8bdc4a69dc9d0e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L983
So if there is one fetch for a host failed and another succeeded but you
ignore the bookkeeping on the driver side and rerun all the task (as you
treat all of the missing after the failure) you will duplicate the rows.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6XG2BMXHUTRLL36FVP2N32Y24E7AVCNFSM6AAAAAB2U3FPRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJUHAZTOOJSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
There are two levels:
The driver does the job level commit as we have seen in the OutputCommitter case outside of the spark job. Just think about inserting into external DB. We are distributed so a session transaction is off the table.
OK I started to run your test spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Lines 2186 to 2188 in 554d678 |
|
I will run the the HA test again today to check the logging which you are looking for.. I have not looked into committing portion of the spark code on executors, but it would seem odd to me, if the executor's task level commit , did not accompany the attempt number ( which I do believe seeing getting passed to executor as part of task request). So if the commits ( when going to say a DB) are being written against an attempt ID, the conflict or duplicate rows should not arise.. |
|
Please add a log to And preferably your test should contain this line. |
No description provided.